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ABSTRACT 



The present invention is a system, method, and program 
product that comprises a computer with a collection of 
documents to be searched. The documents contain free form 
(natural language) text. We define a set of labels called 
QA-Tokens, which function as abstractions of phrases or 
question-types. We define a pattern file, which consists of a 
number of pattern records, each of which has a question 
template, an associated question word pattern, and an asso- 
ciated set of QA-Tokens. We describe a query-analysis 
process which receives a query as input and matches it to 
one or more of the question templates, where a priority 
algorithm determines which match is used if there is more 
than one. The query-analysis process then replaces the 
associated question word pattern in the matching query with 
the associated set of QA-Tokens, and possibly some other 
words. This results in a processed query having some 
combination of original query tokens, new tokens from the 
pattern file, and QA-Tokens, possibly with weights. We 
describe a pattern-matching process that identifies patterns 
of text in the document collection and augments the location 
with corresponding QA-Tokens. We define a text index data 
structure which is an inverted list of the locations of all of 
the words in the document collection, together with the 
locations of all of the augmented QA-Tokens. A search 
process then matches the processed query against a window 
of a user-selected number of sentences that is slid across the 
document texts. A hit-list of top-scoring windows is returned 
to the user. 

13 Claims, 12 Drawing Sheets 
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SYSTEM, METHOD AND PROGRAM 
PRODUCT FOR ANSWERING QUESTIONS 
USING A SEARCH ENGINE 

This application claims the benefit of provisional appli- 
cation 60/161,427 filed Oct. 26, 1999. 

FIELD OF THE INVENTION 
This invention relates to the field of querying and search- 
ing collections of text. More specifically, the invention 
relates to querying and searching collections of text in a 
networking environment 

BACKGROUND OF THE INVENTION 
Text documents contain a great deal of factual informa- 
tion. For example, an encyclopedia contains many text 
articles consisting almost entirely of factual information. 
Newspaper articles contain many facts along with descrip- 
tions of newsworthy events. The World Wide Web contains 
millions of text documents, which in turn contain at least a 
small amount of factual information. 

Given this collection of factual information, we naturally 
desire the ability to answer questions based on this infor- 
mation using automatic computer programs. Previously two 
kinds of computer programs have been created to search 
factual information: database management systems and 
information retrieval systems. A database management sys- 
tem (DBMS) assumes that information is stored in a struc- 
tured fashion, such that each data element has a known data 
type and a set of legal operations. For example, the relational 
database management system (RDBMS) provides the Struc- 
tured Query Language, or SQL, which specifies a syntax and 
grammar for the formulation of queries against the database. 
SQL is based on a relational calculus that restricts queries to 
include only certain operations on certain data types, and 
certain combinations of those operations. 

A relational database is tailored to applications where the 
factual information is available in a structured form. To 
address the factual information contained in free text 
documents, information retrieval (IR) systems were created. 
An information retrieval system indexes a collection of 
documents using textual features (e.g., words, noun phrases, 
named entities, etc.). The document collection can then be 
searched using either Boolean queries or natural language 
queries. A Boolean query consists of textual features and 
Boolean operators (e.g., and, or, not). To evaluate a Boolean 
query, an IR system returns the set of documents that 
satisfies the Boolean expression. A natural language query is 
a free form text query that describes the user's information 
need. Documents likely to satisfy this information need are 
then found using a retrieval model that matches the query 
with documents in the collection. Popular models include 
the probabilistic and vector space models, both of which use 
text feature occurrence statistics to match queries with 
documents. In all cases, an IR system only identifies entire 
documents that are likely to satisfy a user's information 
need. 

Ideally, the user would be able to phrase a specific 
question, e.g., "What is the population of the world?'*, and 
the computer program would respond with a specific answer, 
e.g., "6 billion". Moreover, the computer program will 
produce these answers by analyzing the factual information 
available in the vast supply of text documents, examples of 
which were given previously. Thus, the problem at hand is 
how to automatically process free text documents and pro- i 
vide specific answers to questions based on the factual 
information contained in the analyzed documents. 
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STATEMENT OF PROBLEMS WITH THE 
PRIOR ART 

When users have an information need, search engines are 
typically used to find the desired information in a large 
collection of documents. The user's query is treated as a bag 
of words, and these are matched against the contents of the 
documents in the collection; the documents with the best 
matches, according to some scoring function, arc returned at 
the top of a hit-list. Such an approach can be quite effective 
in the case one is looking for information about some topic. 
However, if one desires an answer to a question, a different 
approach has to be attempted for the following reasons: (1) 
Using a standard search engine approach, the user gets back 
documents, rather than answers to a question. This then 
requires browsing the documents to see if they do indeed 
contain the desired answers (which they may not) which can 
be a time consuming process. (2) No attempt is made to even 
partially understand the question, and make appropriate 
modifications to the processing. So, for example, if the 
question is "Where is XXX", the word "where" will either 
be left intact and submitted to the search engine, which is a 
problem since any text giving the location of XXX is very 
unlikely to include the word "where", or the word will be 
considered a stop-word and stripped out, leaving the search 
engine with no clue that a location is sought. 

The above discussion describes the most commonly 
found situation. There are two approaches that have been 
used in an attempt to provide better service for the end-user. 
30 The first of these does not directly use search engines at 
all, and is currently in use by AskJeeves 
(www.askjeeves.com). This approach uses a combination of 
databases of facts, semantic networks, ontologies of terms 
and a way to match user's questions to this data to convert 
35 the user's question to one or more standard questions. Thus 
the user will ask a question, and the system will respond with 
a list of questions that the system can answer. These latter 
questions match the user's question in the sense that they 
share some keywords in common. A mapping exists between 
40 these standard questions and reference material, which is 
usually in the form of topical Web pages. This is done by 
generating for these pages one or more templates or 
annotations, which arc matched against the user's questions. 
These templates may be either in natural-language or struc- 
4S hired form. The four major problems with this approach are: 

(1) Building and maintaining this structure is extremely 
labor-intensive and potentially error-prone, and is certainly 
subjective. 

(2) When new textual material (such as news articles) 
so comes into existence, il cannot automatically be incorpo- 
rated in the "knowledge-base" of the system, but must wait 
until a human makes the appropriate links, if at all This 
deficiency creates a time-lag at best and a permanent hole in 
the system's capability at worst. Furthermore, using this 

55 approach, only a pointer to some text where the answer may 
be found is given instead of the answer itself. For instance, 
asking: How old is President Clinton? returns a set of links 
to documents containing information about President 
Clinton, however, there is no guarantee that any of these 

60 documents will contain the age of the President. Generating 
these templates automatically cannot be done accurately 
with the current state of the art in automatic text understand- 
ing. 

(3) It can easily happen that there is no match between the 
55 question and pre-stored templates; in such cases these prior 

art systems default to standard (non-Question-Answering) 
methods of searching. 
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(4) There is no clear way to compute the degree of 
relevance between the question and the text that is returned, 
so it is not straightforward to determine how to rank-order 
these texts. 

The second approach uses traditional search-engines with 
post-processing by linguistic algorithms, and is the default 
mechanism suggested and supported by the TREC-8 
Question -Answering track. In this approach, a question is 
submitted to a traditional search engine and documents are 



from the pattern file, and QA-Tokens, possibly with weights. 
We describe a pattern-matching process that identifies pat- 
terns of text in the document collection and augments the 
location with corresponding QA-Tokens. We define a text 
index data structure which is an inverted list of the locations 
of all of the words in the document collection, together with 
the locations of all of the augmented QA-Tokens. A search 
process then matches the processed query against a window 
of a user-selected number of sentences that is slid across the 
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these documents will be false hits, for reasons outlined 
earlier. Linguistic processing is then applied to these docu- 
ments to detect one (or more) instances of text fragments 
that correspond to an answer to the question in hand. The 
thinking here is that it is too computationally expensive to 
apply sophisticated linguistic processing to a corpus that 
might be several gigabytes in size, but it is reasonable to 
apply such processing to a few dozen or even a few hundred 
documents that come back at the top of the hit list. The 
problem with this approach, though, is that, again for 20 
reasons given earlier, even the top documents in the hit-list 
so generated might not contain the sought-after answers. In 
fact, there may well be documents that do answer the 
questions in the corpus, but score so poorly using traditional 
ranking algorithms that they fail to appear in the top section 25 
of the hit-list that is subject to the linguistic processing. 

The prior art can answer questions that are structured 
(SQL) posed against structured data but can't deal with 
unstructured questions/data. EasyAsk (Tm) 
(www.EasyAsk.cora) is a system which (after some training 
on the underlying database) takes question posed in plain 
English and translates them into an SQL query which then 
retrieves the data. The answers of the questions are con- 
strained to some value as stored in the database. 

OBJECTS OF THE INVENTION 

An object of this invention is an improved system and 
method for determining specific answers from queries of 
text. 



30 
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number of matches of words in the window with words in 
the processed query, weighting if desired. A hit-list of 
top-scoring windows is returned to the user. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing and other objects, aspects and advantages 
will be better understood from the following detailed 
description of preferred embodiments of the invention with 
reference to the drawings that are include the following: 

FIG. 1 is a block diagram of the computing environment 
in which the present invention is used in a non-limiting 
preferred embodiment. 

FIG. 2 is a block diagram of the system architecture. 

FIG. 3 is a block diagram of a Pattern-file. 

FIG. 4 is a flow chart of the Matching Algorithm 

FIG. 5 is a flow chart of the Priority Algorithm. 

FIG. 5a is a block diagram of the Equivalence File. 

FIG. 6 is a block diagram of a text index data structure. 

FIG. 7a is a diagram of a text data structure. 

FIG. 7b is a diagram of pattern files. 

FIG. 7c is a flow chart of a document analysis process. 

FIG. 7d is a flow chart of an indexing process. 

FIG. 8 is a flow chart of a search process. 

DETAILED DESCRIPTION OF THE 
INVENTION 

In general, the invention first analyzes the query by a 



An object of this invention is an improved system and 4 ° P attera - malc hing program, which when the invention rec- 
ethod for determining 

ogmzes certain patterns of words in a query, replaces some 



method for determining answers from queries against free 
form text. 

An object of this invention is an improved system and 
method for determining answers using free form queries. 45 

SUMMARY OF THE INVENTION 
The present invention is a system, method, and program 
product that comprises a computer with a collection of 
documents to be searched. The documents contain free form 50 
(natural language) text. We define a set of labels called 
QA-Tokens, which function as abstractions of phrases or 
question-types. We define a pattern file, which consists of a 
number of pattern records, each of which has a question 
template, an associated question word pattern, and an asso- ss 
ciated set of QA-Tokens. We describe a query- analysis 
process which receives a query as input and matches it to 
one or more of the question templates, where a priority 
algorithm determines which match is used if there is more 
than one. The query-analysis process then replaces the 60 
associated question word pattern in the matching query with 
the associated set of QA-Tokens, and possibly some other 
words. The tokens in the processed query (the words and 
QA-Tokens) are optionally converted to lemma form, stop- 
words are optionally removed and tokens are optionally 65 
assigned weights. This results in a processed query having 
some combination of original query tokens, new tokens 



of them with members of a specially-designed set of query 
tokens called QA-Tokens. As a simple example, the 
PLACES token is substituted for the word "where". The 
premise of this procedure is that a quite high percentage of 
questions can be analyzed in such a manner. 

The text collection from where the answer is to be derived 
is also augmented with QA-Tokens, but by using a different 
recognizer than that used to analyze the question. The search 
engine indexing process is modified to identify potential 
answers to questions (e.g. places, people, numeric quantities 
etc.) and index them. The scoring function is modified to 
score individual sentences, or small sequences of sentences, 
and return these to the user. 

The invention uses one or more of the following obser- 
vations: 

(1) The answers to many questions issued against collec- 
tions of text documents can be found in noun, prepo- 
sitional and adverbial phrases, 

(2) These phrases can be typed by a set of a dozen or so 
labels (such as PERSONS, PLACES, MONEYS), 

(3) A large percentage of questions can be classified and 
mapped to these phrase types (e.g. a "where" question 
is seeking a PLACES), and 

(4) For many questions, all or nearly all of the question 
terms will be found clustered in passages of just one or 
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a small number of sentences, in those documents that token that is already matched against another query term. 

answer the question. Thus the question is "Who was Napoleon's wife" will match 

The key consequence of the prominence of phrases is that, against a passage containing "Napoleon", "wife" plus some 

for the most part, phrases can be detected in text with a Ccrm olhcr ^ Napo lcon that is augmented with PER- 

relatively simple pattern-matching algorithm, certainly a 5 SONS 

uX^l^V^ f3 f Sh ° r t t f ° f Ml f T^'l^' "^ioned improvement is to the scoring algo- 

understandmg. The implementation of the solution is rithm> Search en ^ nes documents bald on 

expressed in modifications to three of the components of a . , „ f tU . \u . ■ j " 

traditional search engine solution, namely query-analysis, how K man y of Acquery terms they contain and how often, 

text parsing and indexing, and scoring. * * * u COntnbutl ° D "J"* 1 °° m P uted for each tem 

The query analysis is enhanced by developing a set of 10 ^ °" do ^ ment 1 . 3nd statistics. It is our 

question-templates that are matched against the user's query, obse ^ation that when a document successfully answers a 

with substitution of certain query terms with special query- f KS f^ a " ° f lhc «> m P 0 ° en ts the question are to be 

tokens that correspond to the phrase labels mentioned above. '°f ' ^ * ^f 1 " * SentenC * ° r ^ ^ We 

So for example, the pattern "where ...» causes the word m ° My f?™* aI f nthm «? «ons sentences (or short 

"where" to be replaced with PLACES. The pattern "how 15 sct * uen «; s °f Aen 0 rather ^ documents. Due to the more 

much does . . . cost" causes those terms to be replaced with filtern * constraujts im P° sed b y ««■ <M- all, or 

MONEYS. The pattern "how old... "causes a replacement m ° St * ^/terras must occur 111 a sentence or short 

with AGES. The base set of such labels is: PLACES &c ^ eac& °^ m > j ralhei ^ ^ document as a whole), then 

PERSONS, ROLES, NAMES, ORGANIZATIONS * com P hcatcd coring function will likely suffice— for 

DURATIONS, AGES, DATES, TIMES, VOLUMES* 20 exam P^' a am P le co ^ 1 of ^ e nimbe r of query terms that 

AREAS, LENGTHS, WEIGHTS, NUMBERS, METHODS^ ma £ ncd ' 

MOSTS, RATES and MONEYS. More specific versions of ^ SC ° fl rmg . P r ° cedure can be made more sophisticated 

these, such as STATES, COUNTRYS, CITYS, YEARS can !? orc lc by cxtcncbn g the q ucr y syntax to allow 

be used as long as the phrase analyser (discussed below) can s P ecificat ion of weights to individual query terms; on 

recognize such quantities. 25 matchm & the terms will contribute to the sentence or 

A synonym operator @SYN( ) is used to deal with cases ^tence-sequence score in proportion to the given weight, 

where a question could be validly matched against more JT ■ 1 T T' 6 SCOrin8 by taking care of lhe 

than one type of phrase. Thus a "who" question could match tol «>wing lands of considerations: Proper names and other 

a proper name, a profession or an organization, so will mu ^- wo ^ terms in the query are rarer than individual 

generate @SYN(PERSON$, ROLES, ORGANIZATIONS) 30 W S ° Presence m answer sentences gives more 

in the modified query. confidence that the sentence is correct that the presence of 

The indexer runs its own pattern template matcher against ^sl^ terms from the query, all other things being equal, so 

the text in the documents to be indexed. For each of the Sh ° Uld be r ighted 

phrase types, a set of patterns needs to be developed For Proposed answer text is no answer if it doesn't contain a 

example, the following arc some of the TIME phrases- 35 ^^-compatible match to a special query-token in the query, 

in the afternoon so spccml qucry tokcrjs snouId b « weighted higher than any 

in the morning W01 *™ * c 

isome of the alternatives in the @SYN-sets may be more 

in 'rADniNAT u desirable than others. For example, "when" might generate 

Jh,;^r^Af° UIS ,r i , , 40 @ SYN CnME$, DATES), where DATES matches specific 

(where CARDINAL is a cardinal number), and so on. dates (e.g. "Jul. 4, 1776") but TIMES matches more general 

Clearly to avoid a huge list (consider that instead of "hours" expressions ("in the afternoon"); a DATES match if there- 

in the last example, almost any word indicating a period of fore usually more desirable than a TIMES match so should 

time could be substituted), a mechanism for concisely be weighted more. 

expressing such variants and for efficient performance of the 45 It is a question of user preference, to be determined how 
matching is desirable. This is the subject of a companion the user interface presenting the results should look. In one 
disclosure "An Efficient and Flexible Phrase Recognizer", design, document titles are returned in a browser as in a 
but is not required for the correct functioning of the present traditional hit-list, but the document's score is inherited 
m ^ .u • ^ , . directly from its best-achieving sentence (or sentence- 
Whenever the indexer succeeds m matching a phrase 50 sequence). On clicking on the document title, the document 

^^f^S^S^i^J^^ 00 ^ ? UCTy fa fctched ' ^ SCroUed 10 location of *» ^t ^tence, 

token (such as TIMES or PLACES) is generated and indexed highlighting the sentence through colour, fonL point-size 

atmatpomtmthedocument,alongwiththeindividuaIterms shading or other markings. In another design the hit-list 

that comprised the phrase. We call this process of adding consists of the best-matching sentences. Clicking on them 

extra indexing terms augmentation. All terms in the docu- ss fetches the documents they belong in as before 

ment not matched m this way are indexed in the usual way A more detailed description of the invention is now 

t 00 - presented in relation to the Figures. 

The search engine operates essentially by the usual bag- FIG. 1 is a block diagram of the computing environment 

of-words matching technique, but is subtly affected by the in which the present invention is used in a non-limitinc 

prcscnccof the special query tokens. Thus the qucry: "When 60 preferred embodiment. The figure shows some of the pos- 

did the Challenger explode" gets translated on query analy- sible hardware, software, and networking configurations that 

sis to the bag {@SYN(TIME$, DATES) Challenger make up the computing environment. 

explode} which matches best against locations in the index The computing environment or system 100 comprises one 

that contain (exactly or variants of) the word Challenger, the or more general purpose computers 170, 175 180 185 190 

word explode and either a TIMES, or a DATES token, 65 and 195 interconnected by a network 105 Examples of 

meaning some phrasal expression of a time or date. The general purpose computers include the IBM Aptiva personal 

special Query Token is not allowed to match against any text computer, the IBM RISC System/6000 workstation and the 
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IBM POWERparallel SP2. (These are Trademarks of the Alternatively, the local question answering system 110 

IBM Corporation.) The network 105 may be a local area may access remote system data 120 and/or a remote index 

network (LAN), a wide area network (WAN), or the Internet. 130 (e.g. on system 175) via the network 105. Alternatively, 
Moreover, the computers in this environment may support the workstation 185 may access a remote question answer- 
the Web information exchange protocol (HTTP) and be part s ing system 110 via the network 105. 

of a local Web or the World Wide Web (WWW). Some Another possible configuration is 175, a workstation with 

computers (e.g., 195) may occasionally or always be dis- an index only. Computer 175 is similar to computer 185 with 

connected 196 from the network and operate as stand-alone the exception that there are no local documents 140. The 

computers. local index 130 is derived from documents 140 accessed via 

In a preferred embodiment, the present invention is imple- 10 the network 105. Otherwise, as in computer 185, the index 

mented as a software component in the computing environ- 130, system data 120, and question answering system 110 

mcnt. The software component is called the Question may be accessed locally or remotely via the network 105 

Answering System 110. The Question Anwering System when processing queries. 

answers questions by processing information contained in Another possible configuration is computer 180, a work- 
documents 140. Documents 140 are items such as books, is station with documents only. The documents 140 stored 
articles, or reports that contain text. One or more documents locally at computer 180 may be accessed by remote question 
are stored on one or more computers in the environment. answering systems 110 via the network 105. When queries 
One or more documents may be grouped together to form a are entered at computer 180, the question answering system 
document database 141. A document database 141 may 110, system data 120, and index 130 must all be accessed 
comprise documents located anywhere in the computing 20 remotely via the network 105. 

environment, e.g., spread across two or more computers. Another possible configuration is computer 190, a client 

The Question Answering System analyzes the documents in station with no local documents 140, index 130, system data 

a document database (see FIG. 7) to create an index 130 for 120, or question answering system 110. When queries arc 

the document database (see FIG. 6). entered at computer 190, the question answering system 110, 

User questions are represented as queries (see FIG. 5) and 25 system data 120, and index 130 must all be accessed 

submitted to a Question Answering System 110 for process- remotely via the network 105. 

ing. The Question Answering System uses system data 120 Another possible configuration is computer 170, a typical 

(see FIGS. 3 and 4) and an index 130 to process the query web server. Queries are entered at another workstation fe g 

and locate relevant text passages in documents 140 (see FIG. 175, 180, 185, or possibly 195) or a client station (e g 190) 

8). The relevant text passages may be further analyzed by the 30 and sent for processing to the web server 170 via the 

Question Answering System to identify specific answers to network 105. The web server 170 uses a remote question 

the question (see FIG. 9), and the results arc returned to the answering system 110, system data 120, and index 130 

^^5' (accessed via the network 105) to process the query. 

Documents 140, indexes 130, and/or system data 120 on Alternatively, one or more of these functions (110 120 and 

one computer may be accessed over the network by another 35 130) can reside on the web server 170. The results are 

computer using the Web (http) protocol, a networked file returned to the workstation or client station from which the 

system protocol (e.g., NFS, AFS), or some other protocol. query was originally sent. 

Services on one computer (e.g., question answering system This general process of indexing is weU known in the 

110) may be invoked over the network by another computer prior art, but new details are disclosed in this invention in 

using the Web protocol, a remote procedure call (RPC) 40 both the indexing and search processes that are especially 

protocol, or some other protocol. su ited to question- answering. 

A number of possible configurations for accessing There are conceptually two parts to the system described 

documents, indexes, system data, and services locally or in this invention. FIG. 2 shows the back-end process 210 in 

remotely are depicted in the present figure. These possibili- which the document collection is processed into indexes 

tics are described further below. 45 and a run-time system 220 which takes a user's question and 

One configuranon is a stand-alone workstation 195 that returns answers by reference to the indexes produced by the 

may or may not be connected to a network 105. The back-end system. Wc first provide an overview of these 

stand-alone system 195 has documents 140, system data processes, then a detailed description. 

120, and an index 130 stored locally. The stand-alone system In the back-end process 210 the document collection 140 

195 also has a question answering system 110 installed 50 is tokenized by Tokenizer 720 which produces a Word-List 

locally. When the system is used, a question is input to the 740 and a Collection Vocabulary 722 The Word-List 740 is 

workstation 195 and processed by the local question answer- processed by Augmentor 730 producing an Augmented 

mg system 110 using system data 120 and the index 130. The Word-List 757. The Augmented Word-List is input to 

results from the question answering system are output by the Indexer 215 which produces Index 130. 

workstation 195. ss 7^ mn-time system 220 takes as input the user's input 

A second configuration is 185, a workstation with 205, which is a text string representing the user's informa- 

documents, system data, and indexes connected to a network tion need in natural language . This is processed by the Query 

105. This configuration is similar to the stand-alone work- Processor Stage I 232 with reference to a set of query 

station 195, except that 185 is always connected to the patterns 242 and equivalents 244. The output of this process 

network 105. Also, the local index 130 may be derived from 60 is a modified text string 245. This string 245 is further 

local documents 140 and/or remote documents accessed via manipulated by Query Processor Stage II 234 which by 

the network 105, and created by either a local question reference to a collection-dependent vocabulary file 722 and 

answering system 110 or a remote question answering stop-word list 246 produces a further-modified query 247 

system 110 accessed via the network 105. When queries are This query 247 undergoes a final transformation in Query 
input at the workstation 185, they may be processed locally 65 Processor Stage III 236 whereby operators directing the 

at 185 using the local question answering system 110, local search (such as indicating window-size) are inserted and a 

system data 120, and local index 130. search-ready query 250 is produced. 
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The search-ready query 250 is then submitted to the 
search engine process 800 which by reference to search 
index 130, previously generated by indexing process 215, 
produces a document hit-list 270. The document hit-list 
contains a rank-ordering of the documents thai contain the 
best passages that match the query, each entry containing, 
amongst other information, the offset in the document where 
the relevant passage is found. This hit-list is presented to the 
user. 

The Augmentation process 730 and the Query Processing 
230 operate on a collection of QA-Tokens 765 which are 
labels for different question types and the corresponding text 
segments in the document collection. A useful, but not 
necessarily complete, list of such tokens used in a preferred 
embodiment, along with an example of each, is presented 
here: 



10 



10 



15 



PLACE* 


Id Ibe Rocky Mountains 


COUNTRY* 


United Kingdom 


STATES 


Massachusetts 


PERSON? 


Albert Einstein 


ROLES 


doctoi 


NAMES 


the Shakespeare Festival 


ORGS 


the US Post Office 


DURATIONS 


for 5 centuries 


AGES 


30 years old 


YEARS 


1999 


TIMES 


in the afternoon 


DATES 


July 4th, 1776 


VOLUMES 


3 gallons 


AREAS 


4 square inches 


LENGTHS 


3 miles 


WEIGHTS 


25 tons 


NUMBERS 


1,234.5 


METHODS 


by rubbing 


RATES 


50 per cent 


MONEYS 


4 million dollars 



40 



We now describe process 230 in which the user's query 205 
is analyzed and transformed into a format suitable for 
submission to the search engine process 800. 

FIGS. 3, 4 and 5 support the description of how a given 
question is analyzed and modified before being submitted to 
a corpus in search of an answer. The premise is that using 
advanced textprocessing methods, certain words, phrases or 
relations can be identified within a corpus. For instance the 45 
phrase President Lincoln can be identified as a NAME of a 
person, from the sentence "Lou Gerstner is the CEO of 
IBM" it can be deduced that Lou Gerstner and IBM have a 
relation— which is "CEO". The identification of a word, 
phrase and relationship is discussed in depth in the literature. 
However, some additional extensions are also covered in 
this disclosure. Throughout this disclosure such identifica- 
tions are referred to as QA-Tokens [765]. The main idea 



Query Processing Stage 1-232. 

In this section it is outlined how to analyze a question 
phrased in plain English. The outcome of such an analysis is 
a "bag of words" consisting of a set of set of QA-Tokens 
[765] and some of the words in the question. These are steps 
in the analysis: 

1) Determine the set of QA-Tokens [765] which describe 
the answer. 

2) Determine which words of the questions should be 
submitted to the search 

3) Determine which other words/phrases arc relevant to 
be submitted to the search 

STEP 1 

Previously, a non-exhaustive list of QA-Tokens [765] 
used in the system was enumerated. To determine a set of 
QA-Tokens [765] which describe the answer to a question a 
Patterns-file [310] is used. A preferred embodiment of such 
a Patterns-file [310] is shown in FIG. 3. 

The data in the Patteras-file[310] is organized in six 
columns. Only columns 330 and 332 are required to contain 
20 data, the rest of the columns 331, 334, 336, 338 can be 
empty. Each of the columns is described in turn. There is no 
limit on the number of rows in 320. FIG. 3 shows six 
different row pattern types which can be repeated unlimited 
number of types and are just examples of many different 
25 possible patterns. 

The first column 330 in FIG. 3 labelled QA-Token [765] 
and captures the type of answer a question is stipulated to 
have. Entries in this column could be a single QA-Token 
[765] or a set of QA-Tokens [765] as a question could 
stipulate more than one type of answer. For example, the 
answer to a question starting with "Name ..." could be a 
person, an organization or anything else which has been 
named. In cases where a set of multiple QA-Tokens [765] 
are possible answer tokens they arc grouped together. A 
preferred embodiment to denote this grouping is by enclos- 
ing the equivalent QA-Tokens [765] with parenthesis and 
prefixing it with the operator @S YN. Column 331 describes 
a weight for the different QA-Tokens [765] as described in 
the first column 330. The weight of each QA-Token [765] 
defaults to 1 if it is not specified. Otherwise, the order of the 
weights in the second column corresponds to the order of the 
QA-Tokens [765] in the first column. These weights may be 
used in subsequent steps of the process described in the 
disclosure, like the search process itself or the finding of the 
most appropriate answer from the hit list. 

How is it determined which QA-Tokens[765j are the 
appropriate ones to capture the answers to a given question? 

FIG. 4 shows one preferred embodiment of the algorithm. 
The Pattern-file [310] as described in FIG. 3 is the input as 
is shown box 410 as is the question Q as shown in box 405. 
Once the Pattem-file[310] is empty the algorithm terminates 
in box 420. In case the Pattern-file[310] is not empty, the 
next pattern is retrieved in box 430 from the pattern file. The 
Character c in column 332 of that pattern is determined in 



30 



35 



50 



disclosed here is the following: Suppose the system cao 

.utom.tic.lty determine that thf uJL to . given question ss bTS^dTZ^'cc^ 1 «T 1 XTf * 
is™ „ ntore of the spec* QA-Tokens P 4 Tnen Z^t^^'^^t.'^Z 

leaving a blank space in between and forming a string S. The 
original question Q is shown in box 470. The actual match- 
ing occurs in box 480 where it is tested whether S is a 



submitting a bag of words which consists of the QA-Tokens 
[765]and some (all) of the words in the question (which is 
discussed in more detail later in this section) to the comus 

answer is discussed in FIG. 8. The issues of determining a 
single answer phrase from a set of returned text passages is 
not covered by this disclosure. 



d could contain a special symbol. In this preferred embodi- 
ment the symbol is a This symbol is a "wildcard", it can 
stand for an arbitrary number of characters (including no 
„ ^ n character at all). If S is a substring of Q it is determined 

The processing 230 of user input 205 to prepare it for 65 whether the current pattern or a previously matching pattern 
submission to the search engine 260 proceeds in three should be retained. This is shown in box 490. The detail of 

box 490 are shown in FIG. 5 which is the priority algorithm. 
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FIG. 5 describes the Priority Algorithm which is applied (and which can be the null string), the detail part which 

if two different patterns match a given question. For example matches the Detail of the pattern and the tail part which is 

there could be two patterns: that part of the string which follows the detail part. The 

YEARS | what year question which gets submitted for further processing in the 

NAMES I what s ncxt C0U P ]c of S,C P S is mc concatenation of the initial part, 

Suppose the question to be analyzed is "In what year did ^ OA-Tokens [765] as determined by the "best-match" 

Mozart compose Eine kleine Nachlmusik". Following the pattem ' dctaU aod cnd In csscncc thc characteristic 

algorithm as described in FIG. 4, both patterns would match. P a A rt °{ ^j^Jf 10 ? S et f . r «P laced b y appropriate 

Clearly, the pattern YEARS I what year is the more desirable QA-Tokcns [765], a fact which has to be taken into account 

pattern as the token YEARS captures the essence of the 10 ^ desi ^ ,n e the Patler °-™e [FIG. 3]. 

question. Hence a general Priority Algorithm is needed. The , 7*. 

conflict resolution is between the last pattern 510 matched In !™ r ste P 11 15 dlscussed now to address synonyms, 

and the currently matched pattern 520. The last pattern 510 canomcaJ fo ™s and related phrases. Synonyms are words or 

matched is initialized in thc beginning to an empty string phrases which mean the same - For «ample, a question may 

The Character and the Detail parts of the two patterns are 15 a cerlaiD P hrase ' me text from where ^ aas9f ^ * to be 

compared. Note that thc wildcard characters are retained in g0tten may a mQKni P hrase - To ^dress this issue, the 

the Character and the Detail. However, this resolution algo- *f move and Add columns as described in the Patterns-File 

rithm needs to be expanded as the following example shows: [FI0, 3] whcrc mtroduccd However, there is a different 

The question is: What is the capital of this district? m & le to f W^y™ Several patterns could be the same, 

The two patterns which match are: 20 cxccpt for a sm S le word or phrase. For example: The two 

CAPITALS what] capital questions: What is the salary? or What is the compensation? 

/ffiSYNfPT apf* ktamp«\ «,h a tl Ai.tri . bcg for mc samc typc of answcr ( and hcncc QA-Token 

@S YN(PLACE$ NAMES) what| district [765]) which fe MONEYS. 

In this case the above described algorithm just exploring Toward this eQ(J , fcrrcd crabodiment of an ^ uiya . 

the number of words in the Character and Detail not includ- lence File * shown in P FIG . Sa ^ fiist ^'JJ 

Xn L W n, k f° eS T fT ^ ' iV* T> which start with an - underscore symbol and is referred to 

Tf wol Z n 7" * T C *l f lhC DUm u ber as *» ^ 11 P™ te t0 a linked * strings which are aU 

of words in the Detail are the same, the whole question has eairivaient In th? Pattern* p;ip rprr n. ♦! 

to be explored which means that the words of the question "MONEYS how muJ fs^ W^ denS Xlv 
have to be substituted in the Character and the Detail part. „ >l222J?^?2l2 fZ^EFmc I S 7 
After that has been performed the number of words in the 30 Zc^n »L h "l n T cir 1 Hence the 

r<u ara „< j ( . /. * u i j question 0 as shown m box 405 in FIG. 4 is not the original 

Character and the detail have to be explored: queslion „ posed by me ^ bm i& preprocessed . u 

unaracxer i wnat whcthcr a WOfd {q ^ qucstion ^ a mcmbcr of of |hc 

Detail 1: is the capital linked lists, if so, it-replaces the word by its type. In one 

Character 2: what 35 preferred embodiment, thc data is not stored as a set of 

Detail 2: is the capital of this district linked list but as an associative array. Using that data 

The same algorithm as described in FIG. 5 is applied to structure, the "lookup" of the type of a word can be done 

these character and detail strings and a resolution is guar- efficiently. 

anteed. After the words have been substituted in the char- Q uc *y Processing Stage IT-234 

acter and the detail the number of words have to be different 40 An aspect of the system is that the type of analysis on the 

in either the characteristic or the detail part: question is in sync with the type of analysis done on the 

Explanation corpus of text where the answer is found. Towards this end 

For a question to match two different patterns and the the following steps are performed in the question analysis if 

character and the detail of the two patterns having the same toey are performed in the indexing part: 

number of words, thc patterns have to contain wildcards. 45 1) find the lemma form of a given word 

The question itself would first match on the first word it 2) find all the equivalent canonical forms of a word or set 

encountered in thc pattern and then the second. Hence the of words In case there are several canonical forms for 

count of words would differ in this case^ a gi ven phrase> they are enclosed ^ the ^ 

The columns Remove 336 and Add 338 in the Patterns- @SYN( ) operator as the equivalent QA-Tokens[765l) 

File as shown in FIG. 3 are explained now. At this stage, it so to make processing by the back end easier 

was determined which pattern matches a given question the Another file used in the processing of the question is the 

best and it will be referred to as PI. If the Remove column stop word file [246]. This is just a plain file consisting of 

336 contains a string Rl, it is now removed from the PI words which should be eliminated from a question as they 

string. In case that the Add column 338 contains a string SI, would not add any additional information and may actually 

SI is added to PI string 5 5 skew the search results. Typical examples are "and, or, is, 

In the Patterns-File 310 [FIG. 3] some of the strings have have, the". Words which are mentioned in the stop word file 

a special symbol "' as a prefix which will be explained later [246] are removed from the question 

onm this section. Another filc> ^ co i lcction vocabulary file [722] is used 

t t- • m svstem - Such a file ^ created using a vocabulary file 

In this step it is determined which part of the question 60 created during indexing time, which contains all the words 

should be submitted to the search engine. At this point a and some relevant statistics of the words in the collection 

™t-mateh" pattern has been determined and if the Remove Using a set of rules a collection vocabulary file creator 

436 and Add 338 columns were specified the appropriate determines which words are equivalent. For example it may 

substitutions made. Furthermore, the original question can determine that President Bill Clinton and President William 

be parsed into four parts. The characteristic part, which 65 Jefferson Clinton are the same person. The query analyzer 

matches the Character of the pattern, the initial part which can then take each word in the query and determine from the 

is that part of the question preceding the characteristic part collection vocabulary file [722] all its equivalent words 
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When submitting a bag of words as previously described to data-structures used as reference and those produced as 

the search engine, all the equivalent words enclosed in a output are described. 

©SYN operator arc submitted too. The first stage of the back-end process 710 is Tokeniza- 

Query Processing Stage III-236 uon 720 which extracts the individual words from the input 

We describe here the final pre-processing of the query. 5 to them lexical features . A pre ferr e d embodi- 

This stage consists of attachmg weights to query terms and mcnt of ^ out t of ^ ^ ^ J^d Word Ust 740 

IZ ltot * Lerl 0f Word-List-Elements (WLEs) 745, with which are asso- 

Th*~ Q ~ L.«„ ™'ki * ' a' i • u.- a ciatcd a °y Properties that the Tokenizer system 720 can 

There are many possible ways to mdicate weighting. A ideQtif PossiMe ^ ^ y ^ 

LhSh l!^ 0 ^ ^ T @WmG1 7 °P era, ? r 10 part-of-speech, whether capitalized, lemma form. The 
which takes a weight and a terms as arguments. Thus to wt p c n±* ii-u^i **„JL„ ■ . , . , , . , 

specify that the word "computer" in a query has weight N, ^ I 45 , S. 5^ ?if g f M * ? f d lmked " JlSt 

the term in the query becomes ©WEIGHT (N computer) 

Tt,*^ a™ « ki • u«- u j ..j. .J 75U on the linked list. An augmentation is simply a new 

^hTT^r*^ '* C ^" > m fra 8n.ent of the word-lis. 740 (bat replaces an exisung 

S^'. tem en^ T°f t- " a CqUC f C) ' T*" 15 ^S"™ 1 . b <» ^ "» old fraement accessible as a side* 
document frequency calculations. A preferred weighting „ h ° n 7 «< JL, w ^ , .• 7 . 

scheme for our invention is as follows STtw . , P -nu^T^ ™ ? ° WD m 

nA T j . e . FIG. 7a. Augmentations will be generated by the Augmen- 

QA-Tokens are assigned a weight of 4 to|ion process described ^ ^ M J^J^ to 

Proper names are assigned a weight of 2 label segments of the word-list 740. Augmentations may be 
all other query terms are assigned a weight of 1. 20 applied to augmentations, if desired. A first round of aug- 
A preferred embodiment of specification of window size mentations may be generated by the Tokenizer 720 itself 
that is uniform with the @S YN and ©WEIGHT syntax is to For example, whole numbers may identified during tokeni- 
use the ©WINDOW operator. The first argument N is the zation and so labelled with the augmentation : CARDINAL 
window size, the second is the entire query. A possible (The colon':' is merely a syntactic construct to avoid con- 
modification of this scheme is to specify that the window is 25 fusion with any instances of the actual word in text) Thus 
dynamic-N, which will cause the search engine to try all any patterns 775 described below that involve cardinal 
window sizes from 1 up to N. This is so that if aU query numbers, say, may employ the identifier CARDINAL to 
terms that match within a window of size N sentences avoid having a specific pattern instance for every possible 
actually fall within a sub-window of M<N consecutive numerical value. It does not matter that a given pattern can 
sentences then the smaller window match will be the one 30 match "bogus" text strings— for example, "at -CARDINAL 
returned to the user. o'clock" can match "at -57.56463 o'clock", since those text 
Another modification of this scheme is to allow the user strings will rarely if at all occur, and if they do then what is 
to specify that matching be exclusive-4hat is that any expected of a system such as the present one is undefined. 
QA-Token in the query does match a term in text that The Augmentation process 730 takes as input a Word-List 
already matches some other query term. Apreferred embodi- 35 740 and a QA-Token file 760. It is assumed that the 
menl of th^specificaUon is to use the ©EXECWIN operator Augmentation process has the capability of accepting one or 
instead of the@WINDOW operator. Asan exampleof all of more input patterns 775 and a text-stream 740, and identi- 
ties! features operating together, suppose that the user's fying any and all locations in the text-stream 740 where 
original query is "Who was the wife of Napoleon?". Sup- instances of the patterns 775 occur. This is a standard 
pose a dynamic 1 exclusive window of size 2 is desired. Then 4 0 capability of text parsers described in the prior art The 
the output of "the Query Processing will be: ©EXECWIN QA-Token file 760 consists of a number of records 762 each 

fe^?^^t®^ (PERS ° N$ NAME$ » of which has fieIds; a QA-Token 765 and a pomter or 

@WEIGHT(lwife) @WEIGHT(2 Napoleon)) reference 767 to a QA-File. The QA-Token file consists of 

FIG. 6 is a block diagram of an index 130, which, al a one or more patterns 775 written in whatever syntax the 
minimum, consists of an inverted file 600. An inverted file 45 pattern-matcher in the Augmcntor 730 requires The 
600 contains an inverted list 610 for every indexable feature QA-Token is the identifier used to mark the location and 
identified during document analysis (see FIG. 7). An existence of the patterns 775. An example of a QA-Token 
inverted list 610 contains a document entry 620 for every file and two QA-Files are shown in FIG 7b 
document in which the corresponding feature appears. A It will be clear from the examples of patterns 775 shown 
document entry 620 contains a document identifier 621 that 50 in FIG. 7b that given the large number of different measure- 
identifies the corresponding .document, and a location entry ment systems (e.g. for TIME there are seconds, minutes 
630 for every occurrence of the feature in the document. A days, . . . , centuries, millennia, . . . microseconds . . . , and 
location entry 630 identifies where in the document the for WEIGHT there are pounds, ounces, tons, grams kilo, 
feature occurs, and includes a sentence number 631 identi- grams and so on) that unless some steps are taken the 
fying the sentence a word number 632 identifying the word 55 number of patterns required will be enormous, especially if 
in the sentence, and a length 633 indicating how many words a given pattern requires two or more such entities. Therefore 
make up the feature. The features 607 in an inverted file 600 we anticipate a simple substitution scheme whereby 
arc typically organized in a dictionary 605, which aUows the variables, for discussion's sake indicated syntactically by a 
inverted list 610 for a particular term 607 to be accessed. leading underscore (< % are defined to stand for a collection 

MG. 7 is a flowchart showing the method steps for 60 of base terms. Thus the variablc_WEIGHT mrnht be 

document analysis in one preferred embodiment of the defined to stand for the set {pounds, ounces tons } and 

Pn £F rt n — a a . an individua l P^tcrn 775« might reference it by 

Prior to being indexed, the text undergoes a process of " :CARDINAL_WE1GHT", say, in so doing encoding all 

Augmentation via pattern-matching. This is performed by patterns that consist of a cardinal number followed byone of 

well-known techniques in the prior art such as string- 65 the given units of weight measurement. Such a mechanism 

matching and processing by finite-state-machines (FSMs). is useful for efficiency, and is present in the preferred 

These operations are not described in detail here, but the embodiment, but is not required for the correct operation of 
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this invention. Implementation of a parser that can accept features identified in the query are located in the index 600, 

such variables in its patterns is standard in the prior art. the corresponding inverted lists 610 are retrieved from the 

The operation of the Augmentation process 730 is index, and any synonym operators in the query are evaluated 

depicted in FIG. 7c. In step 780, WLE-pointer 781 is set to by combining the inverted lists for features in the synonym, 
point to the first WLE of Word-List 740. Step 782 iterates 5 After initialization, a loop to score all of the documents is 

through all records 762 in QA-Tokcn file 760. Suppose a entered in 820. If there are more documents to score, scoring 

particular record 762a is selected, consisting of QA-Token of the next document proceeds in step 830. Otherwise 

765a and pointer 767a to QA-file 770a. Step 784 iterates execution continues with step 880 (described below). In step 

through every pattern 775 in QA-file 770a in turn. Suppose 830, the current document to score is initialized, which 
a particular pattern 775Z> is selected. In step 786 the pattern- 10 includes identifying the possible sentence windows within 

matcher 730 attempts to match pattern 775i> with the Word- the document to score. The number of sentences in the 

List 740 anchored at the point marked by the WLE-pointer sentence window specified in the query 250. 

781. If a match occurs, step 788 is executed and an aug- The search process then enters a loop in 840 to score each 

mentation ISSb is added to the Word-List at point 781, window in the current document If there are no more 
labelled with the current QA-Token 7765a. Step 790 is then 15 windows to score, the document is added to the hit-list (an 

executed in which it is tested to see if the Word-List Pointer array of scored documents) in step 845, and execution 

781 is at the end of the Word-List 740. If it is, then exit point continues at 820. If there are more windows to score, the 
798 is reached, otherwise Word-List Pointer 781 is advanced current window is initialized in step 850. During step 850, 
(step 792) and the execution returns to step 782. If in step any occurrences of. the features from the query 250 are 
786 no match occurs, step 784 continues to iterate through 20 located in the current window, the current window's score is 
all patterns 775.. If step 784 completes with no match, step initialized to 0, and a loop to score each feature is entered in 

782 continues the iteration though all QA-token files 760. 860. 

When the iteration in step 782 is finished, step 794 is If there are more features to score in the current window, 

executed to see if the Word-List pointer 781 is at the end of execution continues in step 870. In step 870, if the current 
the Word-List 740. If it is, then exit point 798 is reached, 25 feature occurs in the current window, the window's score is 

otherwise Word-List Pointer 781 is advanced (step 796) and incremented by the weight of the feature. The weight may be 

the execution restarts step 782. The output of this process is binary or a more complex tf*idf weight, and it may be 

an Augmented Word-List 757. modified by a weighting factor specified in the query 250. 

The use of a linked list with side-chains as an Augmented The preferred embodiment uses modified binary weights. 

Word-List 757 is a preferred embodiment of a representation 30 The features may also be found in the window in an 

of features added to a sequence of text. Alternative means exclusive fashion, meaning that the word positions of the 

include, but are not limited to, markup such as XML. Thus features found in the current window may not overlap 

the sentence depicted in FIG. la might become in XML: (Recall from FIG. 7 that the features indexed for a document 

„„ v „ may have overlapping word positions if multiple features are 

<TEXTxPERSON>President William Clinton</ 35 found for . the same word during document analysis) 

PERSON>lives in the <PLACE>White House</ Whether or not the features are found in an exclusive fashion 

PLACEx/TEXT>. and the order in which they are found is specified in the 

query 250. 

The following process (see FIG. Id) which takes as input When there are no more features to score in the current 

the Augmented Word List 757 would be modified in a 40 window, execution continues in step 875. In step 875, the 

straightforward way to accommodate any different but func- window score may be modified by a density factor where 

tionally equivalent representation structure. the distance between the first query feature to appear in the 

The Inverted File 600 is built from the Augmented window and the last query feature to appear in the window 

Word-List 757. This process is depicted in FIG. 7o\ Step 710 is measured and factored into the window score. In the 

initializes the process, setting the word list pointer to the first 45 preferred embodiment, the density factor is computed as the 

entry in the augmented word list. Step 712 iterates over each inverse of the measured distance and is added to the window 

entry in the augmented word list. Every entry in the word list score. Thus, the smaller the distance (i.e., the more dense the 

(tokens, augmentations, and QA-tokens) is considered an query features are in the window), the larger the window 

indexable feature. In each case, the "canonical form" of the score. The current window score is then compared with the 

indexable feature is used for indexing purposes (e.g., lemma 50 best window score so far for the current document. If the 

form, QA-tokeo string, etc.). In Step 714, the current word current window score exceeds the document's best window 

list entry is looked up in the Dictionary 605. If it is not score, the document's best window score is updated with the 

found, a new entry for the indexable feature is created in the current window score. Execution then continues with the 

dictionary in Step 716. In Step 718, the Inverted List 610 for next window for the current document in step 840. 

the current indexable feature is updated by adding a Loca- 55 When all documents have been score, the hit-list of 

tion Entry 630 to the Document Entry 620 at the end of the documents is sorted by document score and the top n hits are 

inverted list. The location entry contains the sentence num- returned as Result List 270. The number of hits n to return 

ber 631, word number 632, and length 633 of the indexable is specified with the query 250. Each hit identifies the 

feature. This information is obtained from the Word List- document, the document score, and the window in the 

Entry 745 for the indexable feature. Processing then iterates 60 document that produced that score, 

at Step 712 until there are no more entries on the word list. We claim: 

Processing completes in step 720, where the updated 1. A system for searching free form text comprising' 

lDV t£? d J Ue a 600 If Sav f d " . a computer with one or more memories and one or more 

FIG. 8 is a flowchart showing the method, steps for search central processing units (CPUs), one or more of the 

in one preferred embodiment of the present invention. The 65 memories having one or more documents the docu- 

search process 800 starts with initialization step 810. During ments containing a plurality of words in free form text 

initialization, the search query 250 is parsed, the indexed the free form text having a natural language structmV 
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a pattern data structure having a plurality of pattern 
records, each pattern record containing a questioo 
template, an associated question word pattern, and an 
associated set of QA-Tokens; 

a query process that receives one or more queries as input s 
and matches one or more of the queries to one or more 
of the question templates to determine one or more 
template matches, the query process further replacing 
the associated question word pattern in the matching 
query with the associated set of QA-Tokens, being 10 
processed query QA-Tokens, the query process creat- 
ing a processed query having the QA-Tokens and one 
or more processed query words being the words of the 
queries that were not replaced; 

a text index data structure having a plurality of index 15 
records, each index record having one or more index 
words with one or more index word location in one or 
more of the documents and further having one or more 
index records with one or more index QA-Tokens with 
one or more index QA-Tokcn locations in one or more 
of the documents, the index QA-Tokens being an 20 
abstraction of one or more of the words; and 

a searching process that matches one or more of the 
process query words with one or more of the index 
words and one or more of the processed query 
QA-Tokens with one or more of the index QA-Tokens, 25 
the index words and QA-Tokens being features, the 
searching process further scoring one or more windows 
by sliding the window over one or more sentences of 
one or more of the documents, the score of the window 
being dependent on the number of matching locations 30 
in the window. 

2. A system, as in claim 1, where the matching between 
patterns and queries contains a Priority Algorithm to deter- 
mine the best match. 

3. Asystem, as in claim 1, where the query process further 3S 
omits one or more of the processed query words as useless 
words. 

4. Asystem, as in claim 1, where the query process further 
weights one or more of the processed query words. 

5. A system, as in claim 1, where one or more of the 40 
processed query words are substituted with one or more of 
their canonical forms. 

6. A system, as in claim 1, where one or more of the 
processed query words are substituted with one or more of 
their lemma forms. 45 

7. A system, as in claim 1, where one or more of the 
processed query words are substituted with one or more of 
their equivalent forms. 

8. A system, as in claim 1, where one or more of the 
question templates has one or more template variables. 50 

9. A system, as in claim 8, where one or more of the 
pattern records has a question template with one or more 
template variables that defines a substitute template for 
another question template. 

10. A system, as in claim 8, where one or more of the 55 
template variables changes to create one or more substitute 
templates for another question templates. 

11. A computer executed method for searching a plurality 
of words in one or more documents in free form text, the free 
form text having a natural language structure, the method 60 
comprising the steps of: 

receiving one or more queries as input; 

matching one or more of the queries to one or more 
question templates of a pattern record, the pattern 
record further containing an associated question word *5 
pattern and an associated set of QA-Tokens, the match- 
ing determining one or more template matches, the 
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query process further replacing the associated question 
word pattern in the matching query with the associated 
set of QA-Tokens, being processed query QA-Tokens, 
the query process creating a processed query having the 
QA-Tokens and one or more processed query words 
being the words of the queries that were not replaced; 
and 

a searching process that matches one or more of the 
process query words with one or more index words in 
an index record and one or more of the processed query 
QA-Tokens with one or more of the index QA-Tokens, 
in the index record, the index words and QA-Tokens 
being features, the searching process further scoring 
one or more windows by a sliding the window over the 
one or more sentences of one or more of the documents, 
the score of the window being dependent on the num- 
ber of matching locations in the window. 
12. A computer system for searching a plurality of words 
in one or more documents in free form text, the free form 
text having a natural language structure, the system com- 
prising: 

means for receiving one or more queries as input; 
means for matching one or more of the queries to one or 
more question templates of a pattern record, the pattern 
record further containing an associated question word 
pattern and an associated set of QA-Tokens, the match- 
ing determining one or more template matches, the 
query process further replacing the associated question 
word pattern in the matching query with the associated 
set of QA-Tokens, being processed query QA-Tokens, 
the query process creating a processed query having the 
QA-Tokens and one or more processed query words 
being the words of the queries that were not replaced; 
and 

means for matching one or more of the process query 
words with one or more index words in an index record 
and one or more of the processed query QA-Tokens 
with one or more of the index QA-Tokens, in the index 
record, the index words and QA-Tokens being features, 
the searching process further scoring one or more 
windows by a sliding the window over the one or more 
sentences of one or more of the documents, the score of 
the window being dependent on the number of match- 
ing locations in the window. 
^13. A computer program product that performs the steps 

matching one or more queries to one or more question 
templates of a pattern record, the pattern record further 
containing an associated question word pattern and an 
associated set of QA-Tokens, the matching determining 
one or more template matches, the query process fur- 
ther replacing the associated question word pattern in 
the matching query with the associated set of 
QA-Tokens, being processed query QA-Tokens, the 
query process creating a processed query having the 
QA-Tokens and one or more processed query words 
being the words of the queries that were not replaced; 
and 

a searching process that matches one or more of the 
process query words with one or more index words in 
an index record and one or more of the processed query 
QA-Tokens with one or more of the index QA-Tokens, 
in the index record, the index words and QA-Tokens 
being features, the searching process further scoring 
one or more windows by a sliding the window over the 
one or more sentences of one or more of the documents, 
the score of the window being dependent on the num- 
ber of matching locations in the window. 
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